Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add safe request timeout for running actions manager #743

Merged
merged 1 commit into from
Apr 8, 2024

Conversation

zbirenbaum
Copy link
Contributor

@zbirenbaum zbirenbaum commented Mar 10, 2024

Description

Backport of running actions manager timeout changes which checks the timeout for an action and defaults to a maximum if not set.

Type of change

  • New feature (non-breaking change which adds functionality)

Checklist

  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained

a discussion (no related file):
lgtm, but we need a test.


Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: 0 of 1 LGTMs obtained

Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: pre-commit-checks

a discussion (no related file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

lgtm, but we need a test.

Test has been added


Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: pre-commit-checks

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r4, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: windows-2022 / stable

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: windows-2022 / stable


nativelink-worker/tests/running_actions_manager_test.rs line 2660 at r4 (raw file):

    #[tokio::test]
    async fn running_actions_manager_respects_action_timeout() -> Result<(), Box<dyn std::error::Error>> {
        #[cfg(target_family = "unix")]

nit: This should probably be removed.


nativelink-worker/tests/running_actions_manager_test.rs line 2698 at r4 (raw file):

        )
        .await?;
        const TASK_TIMEOUT: Duration = Duration::from_secs(1);

Tests should not use wall-time. In this case, you probably want to use RunningActionsManagerImpl::new_with_timeout or something to give test control of the timeout function, then use some signaling to facilitate this test.

@zbirenbaum zbirenbaum force-pushed the v0.1-timeout-changes branch 3 times, most recently from 2992810 to 1fbcf47 Compare March 25, 2024 20:28
Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Remote / large-ubuntu-22.04, Vercel, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), pre-commit-checks, publish-image, ubuntu-20.04 / stable, ubuntu-22.04, vale, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04


nativelink-worker/tests/running_actions_manager_test.rs line 2698 at r4 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Tests should not use wall-time. In this case, you probably want to use RunningActionsManagerImpl::new_with_timeout or something to give test control of the timeout function, then use some signaling to facilitate this test.

I think the easiest way to do this without actually waiting is to set both the timeout and max_timeout to 0, at which point the checks for time elapsed will just always return true.

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r5, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: windows-2022 / stable

@allada allada force-pushed the main branch 2 times, most recently from a2a06c2 to 09defc6 Compare March 26, 2024 18:19
Copy link

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r6.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: windows-2022 / stable, and 3 discussions need to be resolved


nativelink-worker/tests/running_actions_manager_test.rs line 2698 at r4 (raw file):

Previously, zbirenbaum (Zach Birenbaum) wrote…

I think the easiest way to do this without actually waiting is to set both the timeout and max_timeout to 0, at which point the checks for time elapsed will just always return true.

I'm not a huge fan of doing this, it makes future bugs even more difficult to write tests for and people will end up copy-pasta the pattern around.

It should be fairly trivial to hand in a custom sleep function and allow the test to override it here, we do this in quite a few places.

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r6, all commit messages.
Reviewable status: 1 of 1 LGTMs obtained, and pending CI: windows-2022 / stable, and 3 discussions need to be resolved

@zbirenbaum zbirenbaum force-pushed the v0.1-timeout-changes branch 4 times, most recently from 10e5ba3 to 00a1250 Compare April 1, 2024 06:57
Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Bazel Dev / ubuntu-22.04, Publish image, Vercel, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, and 2 discussions need to be resolved


nativelink-worker/tests/running_actions_manager_test.rs line 2660 at r4 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: This should probably be removed.

Done.


nativelink-worker/tests/running_actions_manager_test.rs line 2698 at r4 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

I'm not a huge fan of doing this, it makes future bugs even more difficult to write tests for and people will end up copy-pasta the pattern around.

It should be fairly trivial to hand in a custom sleep function and allow the test to override it here, we do this in quite a few places.

I've updated it to take a custom sleep function which will immediately return if the correct timeout is passed and stall indefinitely causing the action to complete and the test to fail otherwise.

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 2 files at r7, 1 of 1 files at r8, all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and pending CI: windows-2022 / stable, and 2 discussions need to be resolved


nativelink-worker/tests/running_actions_manager_test.rs line 2944 at r8 (raw file):

        fs::create_dir_all(&root_work_directory).await?;

        // ignore the sleep and immediately timeout

nit: Caps and period.


nativelink-worker/tests/running_actions_manager_test.rs line 2973 at r8 (raw file):

                // otherwise return pending and fail the test.
                sleep_fn: |duration| {
                    if duration.as_secs() == ACTION_TIMEOUT as u64 {

nit: For readability, assert that duration is ACTION_TIMEOUT and then return ready.

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and pending CI: windows-2022 / stable, and 2 discussions need to be resolved

Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 1 LGTMs obtained, and pending CI: Vercel, pre-commit-checks


nativelink-worker/tests/running_actions_manager_test.rs line 2944 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: Caps and period.

Done.


nativelink-worker/tests/running_actions_manager_test.rs line 2973 at r8 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

nit: For readability, assert that duration is ACTION_TIMEOUT and then return ready.

Done.

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r10, all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and pending CI: windows-2022 / stable

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r11, all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and pending CI: integration-tests (20.04), integration-tests (22.04), windows-2022 / stable

@zbirenbaum zbirenbaum force-pushed the v0.1-timeout-changes branch 2 times, most recently from 58c0060 to 9a81488 Compare April 3, 2024 23:28
Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r12, all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and pending CI: Analyze (javascript-typescript), Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-lre-cc, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-20.04, zig-cc ubuntu-22.04

Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 1 LGTMs obtained, and pending CI: integration-tests (20.04), integration-tests (22.04), windows-2022 / stable, and 1 discussions need to be resolved


nativelink-worker/tests/running_actions_manager_test.rs line 3259 at r12 (raw file):

        let arguments = vec!["sh".to_string(), "-c".to_string(), "sleep 2".to_string()];
        #[cfg(target_family = "windows")]
        let arguments = vec![

This is what is causing the test to fail. It is returning error code 1 which is incorrect function. I have been reading up on it but can't figure it out so I'm just going to remove to remove this test for windows if that's alright @allada

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 2 of 1 LGTMs obtained, and pending CI: integration-tests (20.04), integration-tests (22.04), windows-2022 / stable, and 1 discussions need to be resolved

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 1 LGTMs obtained, and pending CI: integration-tests (20.04), integration-tests (22.04), windows-2022 / stable, and 1 discussions need to be resolved


nativelink-worker/tests/running_actions_manager_test.rs line 3259 at r12 (raw file):

Previously, zbirenbaum (Zach Birenbaum) wrote…

This is what is causing the test to fail. It is returning error code 1 which is incorrect function. I have been reading up on it but can't figure it out so I'm just going to remove to remove this test for windows if that's alright @allada

Why not down below set the error code on windows to expect exit code 1?

Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 1 LGTMs obtained, and pending CI: integration-tests (20.04), integration-tests (22.04), windows-2022 / stable, and 1 discussions need to be resolved


nativelink-worker/tests/running_actions_manager_test.rs line 3259 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

Why not down below set the error code on windows to expect exit code 1?

Judging by the docs it seems like exit code 1 corresponds to something else that wouldn't actually be showing the proper behavior here. It's difficult to tell so I guess its possible that's the expected error code but I'm not certain. If not being sure is preferable to removing the test we can do that

Copy link
Member

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 1 LGTMs obtained, and pending CI: integration-tests (20.04), integration-tests (22.04), windows-2022 / stable, and 1 discussions need to be resolved


nativelink-worker/tests/running_actions_manager_test.rs line 3259 at r12 (raw file):

Previously, zbirenbaum (Zach Birenbaum) wrote…

Judging by the docs it seems like exit code 1 corresponds to something else that wouldn't actually be showing the proper behavior here. It's difficult to tell so I guess its possible that's the expected error code but I'm not certain. If not being sure is preferable to removing the test we can do that

This test is not actually testing the error code, it's checking that the timeout triggers. The error code is a side-effect. In this case, I would just make a TODO and keep the test for windows.

Changes running actions manager to use the action timeout if set to a
nonzero value for deciding when to kill a process, otherwise defaults to
its configured max_action_timeout
Copy link
Contributor Author

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 1 LGTMs obtained, and pending CI: Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Remote / large-ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), ubuntu-20.04 / stable, ubuntu-22.04 / stable, windows-2022 / stable, zig-cc ubuntu-22.04


nativelink-worker/tests/running_actions_manager_test.rs line 3259 at r12 (raw file):

Previously, allada (Nathan (Blaise) Bruer) wrote…

This test is not actually testing the error code, it's checking that the timeout triggers. The error code is a side-effect. In this case, I would just make a TODO and keep the test for windows.

Done.

@zbirenbaum zbirenbaum merged commit 33db963 into TraceMachina:main Apr 8, 2024
26 checks passed
@zbirenbaum zbirenbaum deleted the v0.1-timeout-changes branch April 8, 2024 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants